The data is already scraped from realestate.com.au and saved as a csv file containing 870000 rows.
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from bs4 import BeautifulSoup as bs
import time
import datetime
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('SOLD_complete.csv')
df.head()
df.describe()
df.info()
df.agency.value_counts()
df.agent_name.value_counts()
df.property_type.value_counts()
df.suburb.value_counts()
df[df[['address','sold_date']].duplicated()]
df[df.address[9357] == df.address]
df.agent_title.value_counts()
df.price.head()
df.address.value_counts()[:20]
df.iloc[:,:10].duplicated().sum()
df[['address', 'sold_date']].duplicated().sum()
clean = df.copy()
I will drop the rows without a sold date and then change the rest of them into datatime format.
clean.drop(clean[clean.sold_date.isnull()].index,axis=0, inplace=True)
clean.sold_date = pd.to_datetime(clean.sold_date.apply(lambda x : x.split('on ')[1:][0]))
clean.head()
clean.info()
The values in the suburb column come from the searched suburb name in the scraping process. Some suburb names are greater regions which contain other suburbs in the search results. To fix this I will extract the suburb name from the address column and replace the suburb value with it.
clean.suburb = clean.address.apply(lambda x : x.split(', ')[-1])
clean.head()
Agency names combine the name of the real estate agency and the branch's suburb, I'll extract the agency name from the string.
clean['agency'] = clean.agency.apply(lambda x : x.split(' - ')[0])
clean.head()
It seems agency_title columns doesn't present any valuable information. I'll drop this column from the dataset.
clean.drop(['agent_title','time'],axis=1,inplace=True)
clean.info()
The columns beds, baths, cars and price are extracted in string format, to be able to do analysis and calculations on these numbers I'll transform them into integers and set the rows with no value to 0.
clean.beds = clean.beds.apply(lambda x : x.replace(' ',''))
clean.baths = clean.baths.apply(lambda x : x.replace(' ',''))
clean.cars = clean.cars.apply(lambda x : x.replace(' ',''))
clean.beds = clean.beds.apply(lambda x : int(x) if x!='None' else 0)
clean.baths = clean.baths.apply(lambda x : int(x) if x!='None' else 0)
clean.cars = clean.cars.apply(lambda x : int(x) if x!='None' else 0)
clean.price = clean.price.apply(lambda x: x.replace('$',''))
clean.price = clean.price.apply(lambda x :x.replace(',',''))
clean.price = clean.price.apply(lambda x : 0 if x=='Contact agent' else int(x))
clean.info()
clean.describe()
Some rows do not have a valid address, I'll drop these rows from the dataset.
clean[clean.address.apply(lambda x : 'Address available' in x)].count()
clean.drop(clean[clean.address.apply(lambda x : 'Address available' in x)].index,axis= 0, inplace=True)
clean.address.value_counts()[:10]
Now that the dataset is cleaner I can do some basic assessment on outliers. I'll visually check the strange rows and random check the web page allocated to them on the realestate.com.au.
clean[clean.beds > 10]
It appears that some property listings have large numbers of bedrooms, bathrooms and car parks that are not actually real.
In some other cases the property is a Block of Units and the numbers are associated with the total number of bed, bath, car in all units not in each individual unit.
I think property_type is a good measurment to decide whether these values are real or not.
For the purpose of this analysis which is to analyze the general property market in the state of Victoria, Australia, I'm going to analyze the dataset mainly using the common property types such as House, Townhouse, Unit and Apartment.
clean[clean.price > 10000000]
By looking at the top most expensive properties it is obvious that most of the prices are random and not real, I made sure of this by checking some of the properties on the website.
Also some prices accidently have one or two extra 0's.
I will drop the top 0.1% of rows with the highest number of beds, baths, car parks and prices to normalize the dataset among all property_types.
clean.quantile(.999)
clean.drop(clean[(clean.baths>5)].index, axis=0, inplace=True)
clean.drop(clean[(clean.beds>7)].index, axis=0, inplace=True)
clean.drop(clean[(clean.cars>11)].index, axis=0, inplace=True)
clean.drop(clean[(clean.price>3120000)].index, axis=0, inplace=True)
clean.describe()
To better analyze the dataset, we'll be needing more specific sold_date values such as month, year and month of the year values. Having separate columns for these instead of calculating them everytime in the charts makes life much easier.
clean['year'] = clean.sold_date.apply(lambda x: x.year)
clean['month'] = clean.sold_date.apply(lambda x: x.month)
clean['month_year'] = pd.to_datetime(clean['sold_date']).dt.to_period('M')
And finally I'll try to remove all duplicate rows. In the data gathering process I scraped some specific suburbs twice as I didn't get all properties in the first attempt. Then I merged the datasets together which means there are plenty of duplicate rows.
Beside the rows that I scraped twice, there are rows that are repeated multiple times in the realestate.com.au, these rows are sometimes identical and sometimes some info such as beds, agency or price are different. To recognise these rows I assess the dataset based on the property address and the data it was sold as these 2 values decide whether or not a row is duplicated.
I'll drop the obvious duplicates first.
clean.drop(clean[clean.duplicated()].index,axis=0, inplace=True)
clean[['address', 'sold_date']].duplicated().sum()
In the table below we can see some of the duplicated rows, some of them have different prices with the exact same description, for some rows the property_type is different and the others are different in the number of beds/baths/cars.
clean[clean['address'].isin(clean['address'][clean[['address', 'sold_date']].duplicated()])].sort_values('sold_date')
Finding a pattern or algorithm to clean these rows neatly is going to be hard, since the total number of these rows is the low amount of 1722 compared to over 700000 rows in the dataset, I'll just use drop function to remove the duplicate rows and not taking into account which row it'll keep or which one it'll delete.
clean.drop(clean[clean[['address','sold_date']].duplicated()].index, axis=0, inplace=True)
clean[['address','sold_date']].duplicated().sum()
clean.to_csv('SOLD_complete_cleaned.csv')
sb.countplot(data=clean, x='property_type',color='gray', order=clean.property_type.value_counts().index)
plt.xticks(rotation=90);
3 bedroom properties are the most common type sold on the website, after that comes 4 and 2 bedroom properties.
plt.hist(data=clean, x='beds', bins=np.arange(.5,8.5,1), rwidth=.9)
plt.xlabel('Number of bedrooms')
plt.title('Bedrooms');
The majority of properties have 1 or 2 bathrooms.
plt.hist(data=clean, x='baths', bins=np.arange(.5,7.5,1), rwidth=.9);
plt.xlabel('Number of bathrooms')
plt.title('Bathrooms');
Most properties have 1 or 2 car park spaces, the values for 3 and more car parks may not be very accurate as the definition of car park varies based on the advertisers perspective, some assume the maximum number of cars that can be parked on the property and some assume the designed car parks and garages.
plt.hist(data=clean, x='cars', bins=np.arange(.5,9.5,1), rwidth=.9);
plt.xlabel('Number of car parks')
plt.title('Car parks');
sb.boxplot(data = clean[(clean.beds>0) & (clean.price>0)], x = 'beds', y = 'price', color='gray');
sb.boxplot(data = clean[(clean.baths>0) & (clean.price>0)], x = 'baths', y = 'price', color='gray');
sb.boxplot(data = clean[(clean.cars>0) & (clean.price>0)], x = 'cars', y = 'price', color='gray');
#clean[clean.price>0].groupby(['property_type']).price.mean().sort_values(ascending=False)
plt.barh(y=clean[clean.price>0].groupby(['property_type']).price.mean().sort_values(ascending=False).index,
width=clean[clean.price>0].groupby(['property_type']).price.mean().sort_values(ascending=False))
plt.xticks(rotation=30);
plt.title('Average price of different property types')
plt.xlabel('Price');
It's interesting that average townhouse price is more than average house price. Probabely because townhouses are built in more expensive suburbs where property prices are high enough to make building multiple townhouses instead of 1 house economical.
The figure below show the price distribution, as the majority of properties are priced less than 3 million we will zoom in a bit to see what's going on.
plt.hist(data=clean, x='price', bins=np.arange(1000,5000000,100000), rwidth=.8);
plt.xlabel('Price')
plt.title('Price range');
plt.hist(data=clean, x ='price', bins=np.arange(100000, 3000000, 100000), rwidth=.9);
plt.xlabel('Price')
plt.title('Price range');
The largest price bin for all properties is 350000 to 450000 and the distribution is skewd to the right.
Interestingly there is a gap in the 700000 to 800000 price range, it appears that 700's is not a popular bracket to price a property.
plt.hist(data=clean, x ='price', bins=np.arange(100000, 1500000, 50000), rwidth=.9);
plt.xlabel('Price')
plt.title('Price range');
plt.barh(y= clean.agency.value_counts()[:20].index, width=clean.agency.value_counts()[:20])
plt.title('Top agencies based on the number of properties sold')
plt.xlabel('Count');
Ray White and Barry Plant have got the lead by a large margin and are the top real estate agencies in Victoria
plt.barh(y= (clean[clean.price>0].groupby(['agency']).price.mean()/clean[clean.price>0]
.groupby(['agency']).price.count()).sort_values(ascending=False)[:20].index,
width=(clean[clean.price>0].groupby(['agency']).price.mean()/clean[clean.price>0]
.groupby(['agency']).price.count()).sort_values(ascending=False)[:20]);
plt.title('Average property price sold by top agencies')
plt.xlabel('Price');
This chart excludes the properties with no sold price, therefor the metrics are affected comparing to the previous chart.
plt.barh(y= clean.agent_name.value_counts()[2:20].index, width=clean.agent_name.value_counts()[2:20])
plt.title('Number of properties sold by top agents')
plt.xlabel('Count');
plt.barh(y= (clean[clean.price>0].groupby(['agent_name']).price.mean()/clean[clean.price>0]
.groupby(['agent_name']).price.count()).sort_values(ascending=False)[:20].index,
width=(clean[clean.price>0].groupby(['agent_name']).price.mean()/clean[clean.price>0]
.groupby(['agent_name']).price.count()).sort_values(ascending=False)[:20]);
plt.title('Average property price sold by top agents')
plt.xlabel('Price');
This chart may not be the best repsentative of real estate agents as by selling only one property with a very high price the agent can get on top of this chart. I belive number of properties sold is the better estimator.
plt.barh(y= clean.suburb.value_counts()[:20].index, width=clean.suburb.value_counts()[:20])
plt.title('Number of properties sold in top suburbs')
plt.xlabel('Count');
The top 3 most active suburbs based on the number of properties sold are all in the same area in south eastern Melbourne suburbs. It raises the question, what was or is going on in these suburbs? Are they being developed? What makes them so special that so many properties are being sold there?
Perhaps later on I can make a heat map on top of the actual map of Melbourne to visualise and connect the numbers to actual locations on the map.
sb.countplot(data=clean, x=clean.sold_date.apply(lambda x : x.month),color='gray')
plt.xlabel('Sold date');
plt.title('Distribution of sold properties based on the month');
plt.bar(x=clean[clean.price>0].year.value_counts().index.sort_values()[-12:],
height=clean[clean.price>0].groupby(['year']).price.mean()[-12:])
plt.xticks(np.arange(2008,2020,1),rotation=45);
plt.title('Average property price');
plt.bar(x=clean[clean.price>0].year.value_counts().index.sort_values()[-12:],
height=clean[clean.price>0].groupby(['year']).price.median()[-12:])
plt.xticks(np.arange(2008,2020,1),rotation=45);
plt.title('Median property price');
This chart shows very interesting insights into properties price fluctuations during the year, as we can see in the chart avergae and median property prices are both significantly lower in January each year. If we compare that to March, 2 month later, the difference is roughly 20% cheaper which is huge.
tt1 = clean[(clean.price>0)].query('sold_date>"2007-12-31" & sold_date<"2019-01-01"').copy()
fig = plt.figure(figsize=(15, 4))
ax = fig.add_subplot(111)
ax.plot(tt1.groupby(['month']).price.median());
ax.plot(tt1.groupby(['month']).price.mean());
plt.title('Median and Average property prices during the year (2008-2018)')
plt.legend(['Median','Average'])
plt.xticks(np.arange(1,13,1),['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']);
plt.bar(x=clean[clean.price>0].year.value_counts().index.sort_values()[-12:],
height=clean[clean.price>0].groupby(['year']).price.mean()[-12:])
plt.xticks(np.arange(2008,2020,1),rotation=45);
plt.title('Average property price');
plt.bar(x=clean[clean.price>0].year.value_counts().index.sort_values()[-12:],
height=clean[clean.price>0].groupby(['year']).price.median()[-12:])
plt.xticks(np.arange(2008,2020,1),rotation=45);
plt.title('Median property price');
Comparing price trends between 2 or more subsets can give us excellent insight. Therefor I'll compare different subsets of the dataset that I think can be interestin to compare in the following charts.
tt1 = clean[(clean.price>0) &(clean.property_type=='House')].query('sold_date>"2007-12-31"').copy()
tt2 = clean[(clean.price>0) &(clean.property_type=='Townhouse')].query('sold_date>"2007-12-31"').copy()
tt3 = clean[(clean.price>0) &(clean.property_type=='Unit')].query('sold_date>"2007-12-31"').copy()
tt4 = clean[(clean.price>0) &(clean.property_type=='Apartment')].query('sold_date>"2007-12-31"').copy()
grouped1 = tt1.set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped2 = tt2.set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped3 = tt3.set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped4 = tt4.set_index('sold_date').groupby(pd.TimeGrouper('M'))
fig = plt.figure(figsize=(15, 4))
ax = fig.add_subplot(111)
ax.plot(grouped1.price.count().dropna());
ax.plot(grouped2.price.count().dropna());
ax.plot(grouped3.price.count().dropna());
ax.plot(grouped4.price.count().dropna());
plt.title('Number of sold properties')
plt.legend(['House','Unit', 'Townhouse','Apartment']);
tt1 = clean[(clean.price>0) &(clean.property_type=='House')].query('sold_date>"2007-12-31"').copy()
tt2 = clean[(clean.price>0) &(clean.property_type=='Townhouse')].query('sold_date>"2007-12-31"').copy()
tt3 = clean[(clean.price>0) &(clean.property_type=='Unit')].query('sold_date>"2007-12-31"').copy()
tt4 = clean[(clean.price>0) &(clean.property_type=='Apartment')].query('sold_date>"2007-12-31"').copy()
fig = plt.figure(figsize=(15, 4))
ax = fig.add_subplot(111)
ax.plot(tt1.groupby(['month']).price.mean());
ax.plot(tt2.groupby(['month']).price.mean());
ax.plot(tt3.groupby(['month']).price.mean());
ax.plot(tt4.groupby(['month']).price.mean());
plt.title('Average property prices during the year (2008-2019) based on property type')
plt.legend(['House','Townhouse','Unit','Apartment'])
plt.xticks(np.arange(1,13,1),['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']);
First let's take a look at the price trend between 2 suburbs that I know and have lived before.
tt1 = clean[(clean.price>0) & (clean.suburb=='Melton')].query('sold_date>"2007-12-31"').copy()
tt2 = clean[(clean.price>0) & (clean.suburb=='St Albans')].query('sold_date>"2007-12-31"').copy()
grouped1 = tt1.set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped2 = tt2.set_index('sold_date').groupby(pd.TimeGrouper('M'))
fig = plt.figure(figsize=(15, 4))
ax = fig.add_subplot(111)
ax.plot(grouped1.price.mean().dropna());
ax.plot(grouped2.price.mean().dropna());
plt.legend(['Melton','st albans']);
The number of sold houses is far larger than any other residential property type and it has grown over the years.
tt1 = clean[(clean.property_type=='House')].query('sold_date>"2007-12-31"').copy()
tt2 = clean[(clean.property_type=='Unit')].query('sold_date>"2007-12-31"').copy()
tt3 = clean[(clean.property_type=='Townhouse')].query('sold_date>"2007-12-31"').copy()
tt4 = clean[(clean.property_type=='Apartment')].query('sold_date>"2007-12-31"').copy()
grouped1 = tt1.set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped2 = tt2.set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped3 = tt3.set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped4 = tt4.set_index('sold_date').groupby(pd.TimeGrouper('M'))
fig = plt.figure(figsize=(15, 4))
ax = fig.add_subplot(111)
ax.plot(grouped1.price.count().dropna());
ax.plot(grouped2.price.count().dropna());
ax.plot(grouped3.price.count().dropna());
ax.plot(grouped4.price.count().dropna());
plt.title('Number of sold properties')
plt.legend(['House','Unit', 'Townhouse','Apartment']);
Appears Units, Townhoueses and Houses are pretty correlated.
tt1 = clean[(clean.price>0) & (clean.property_type=='House')].query('sold_date>"2007-12-31"').copy()
tt2 = clean[(clean.price>0) & (clean.property_type=='Unit')].query('sold_date>"2007-12-31"').copy()
tt3 = clean[(clean.price>0) & (clean.property_type=='Townhouse')].query('sold_date>"2007-12-31"').copy()
tt4 = clean[(clean.price>0) & (clean.property_type=='Apartment')].query('sold_date>"2007-12-31"').copy()
grouped1 = tt1.set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped2 = tt2.set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped3 = tt3.set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped4 = tt4.set_index('sold_date').groupby(pd.TimeGrouper('M'))
fig = plt.figure(figsize=(15, 4))
ax = fig.add_subplot(111)
ax.plot(grouped1.price.median().dropna());
ax.plot(grouped2.price.median().dropna());
ax.plot(grouped3.price.median().dropna());
ax.plot(grouped4.price.median().dropna());
plt.title('Median price of different property types')
plt.legend(['House','Unit', 'Townhouse','Apartment']);
This below chart show the average price of different property types as the number of bedrooms change. We can clearly see in 2 bedroom properties, all 3 major property types are in the same ball park excep Unit which is slightly cheaper on average. It is in 3 bedrooms properties that the averages really change, we can clearly see the average price for 3 bedrooms properties is highest for Apartment, then Townhouse, then House and finally Units.
tt1 = clean[(clean.price>0) & (clean.property_type=='House')].query('sold_date>"2007-12-31" & sold_date<"2019-01-01"').copy()
tt2 = clean[(clean.price>0) & (clean.property_type=='Unit')].query('sold_date>"2007-12-31" & sold_date<"2019-01-01"').copy()
tt3 = clean[(clean.price>0) & (clean.property_type=='Townhouse')].query('sold_date>"2007-12-31" & sold_date<"2019-01-01"').copy()
tt4 = clean[(clean.price>0) & (clean.property_type=='Apartment')].query('sold_date>"2007-12-31" & sold_date<"2019-01-01"').copy()
grouped1 = tt1[tt1.beds==2].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped2 = tt2[tt2.beds==2].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped3 = tt3[tt3.beds==2].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped4 = tt4[tt4.beds==2].set_index('sold_date').groupby(pd.TimeGrouper('M'))
fig = plt.figure(figsize=(15, 12))
ax = fig.add_subplot(311)
ax.plot(grouped1.price.mean().dropna());
ax.plot(grouped2.price.mean().dropna());
ax.plot(grouped3.price.mean().dropna());
ax.plot(grouped4.price.mean().dropna());
plt.title('2 bedrooms')
plt.legend(['House','Unit', 'Townhouse', 'Apartment']);
grouped1 = tt1[tt1.beds==3].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped2 = tt2[tt2.beds==3].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped3 = tt3[tt3.beds==3].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped4 = tt4[tt4.beds==3].set_index('sold_date').groupby(pd.TimeGrouper('M'))
az = fig.add_subplot(312)
az.plot(grouped1.price.mean().dropna());
az.plot(grouped2.price.mean().dropna());
az.plot(grouped3.price.mean().dropna());
az.plot(grouped4.price.mean().dropna());
plt.title('3 bedrooms')
plt.legend(['House','Unit', 'Townhouse', 'Apartment']);
grouped1 = tt1[tt1.beds==4].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped2 = tt2[tt2.beds==4].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped3 = tt3[tt3.beds==4].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped4 = tt4[tt4.beds==4].set_index('sold_date').groupby(pd.TimeGrouper('M'))
ac = fig.add_subplot(313)
ac.plot(grouped1.price.mean().dropna());
ac.plot(grouped2.price.mean().dropna());
ac.plot(grouped3.price.mean().dropna());
ac.plot(grouped4.price.mean().dropna());
plt.title('4 bedrooms')
plt.legend(['House','Unit', 'Townhouse', 'Apartment']);
tt1 = clean[(clean.price>0) & (clean.beds==1)].query('sold_date>"2007-12-31"').copy()
tt2 = clean[(clean.price>0) & (clean.beds==2)].query('sold_date>"2007-12-31"').copy()
tt3 = clean[(clean.price>0) & (clean.beds==3)].query('sold_date>"2007-12-31"').copy()
tt4 = clean[(clean.price>0) & (clean.beds==4)].query('sold_date>"2007-12-31"').copy()
grouped1 = tt1[tt1.property_type=='House'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped2 = tt2[tt2.property_type=='House'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped3 = tt3[tt3.property_type=='House'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped4 = tt4[tt4.property_type=='House'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
fig = plt.figure(figsize=(15, 12))
ax = fig.add_subplot(411)
ax.plot(grouped1.price.mean().dropna());
ax.plot(grouped2.price.mean().dropna());
ax.plot(grouped3.price.mean().dropna());
ax.plot(grouped4.price.mean().dropna());
plt.title('House prices based on number of bedrooms')
plt.legend(['1','2', '3', '4']);
grouped1 = tt1[tt1.property_type=='Townhouse'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped2 = tt2[tt2.property_type=='Townhouse'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped3 = tt3[tt3.property_type=='Townhouse'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped4 = tt4[tt4.property_type=='Townhouse'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
az = fig.add_subplot(412)
az.plot(grouped1.price.mean().dropna());
az.plot(grouped2.price.mean().dropna());
az.plot(grouped3.price.mean().dropna());
az.plot(grouped4.price.mean().dropna());
plt.title('Townhouse prices based on number of bedrooms')
plt.legend(['1','2', '3', '4']);
grouped1 = tt1[tt1.property_type=='Unit'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped2 = tt2[tt2.property_type=='Unit'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped3 = tt3[tt3.property_type=='Unit'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped4 = tt4[tt4.property_type=='Unit'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
ac = fig.add_subplot(413)
ac.plot(grouped1.price.mean().dropna());
ac.plot(grouped2.price.mean().dropna());
ac.plot(grouped3.price.mean().dropna());
ac.plot(grouped4.price.mean().dropna());
plt.title('Unit prices based on number of bedrooms')
plt.legend(['1','2', '3', '4']);
grouped1 = tt1[tt1.property_type=='Apartment'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped2 = tt2[tt2.property_type=='Apartment'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped3 = tt3[tt3.property_type=='Apartment'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped4 = tt4[tt4.property_type=='Apartment'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
av = fig.add_subplot(414)
av.plot(grouped1.price.mean().dropna());
av.plot(grouped2.price.mean().dropna());
av.plot(grouped3.price.mean().dropna());
av.plot(grouped4.price.mean().dropna());
plt.title('Apartment prices based on number of bedrooms')
plt.legend(['1','2', '3', '4']);
I'm going to calculate the average price increase in all suburbs between 2013 and 2018 and then pick the fastest growing ones to plot. Comparing the suburbs visually by looking at the charts would be challenging.
tem = pd.DataFrame()
tem['name'] = clean.groupby(['suburb']).address.count().index
dd = tem['name'].apply(lambda x : clean[clean.suburb==x].property_type.value_counts())
dd = dd.fillna(0)
tem['count_house'] = dd.House
tem['count_townhouse'] = dd.Townhouse
tem['count_unit'] = dd.Unit
tem['count_apartment'] = dd.Apartment
b = clean[(clean.price>0) & (clean.property_type=='Unit')].query('sold_date>"2007-12-31"').copy()
g = b.set_index('sold_date').groupby(pd.TimeGrouper('Y'))
fig = plt.figure(figsize=(15, 12))
ax = fig.add_subplot(411)
ax.plot(g.price.mean().dropna());
a = tem[tem.count_unit > 300]
for i in a['name']:
b = clean[(clean.price>0) & (clean.property_type=='Townhouse') & (clean.suburb == i)].query('sold_date>"2007-12-31"').copy()
g = b.set_index('sold_date').groupby(pd.TimeGrouper('Y'))
fig = plt.figure(figsize=(15, 12))
ax = fig.add_subplot(411)
ax.plot(g.price.mean().dropna());
plt.title(i)
a = tem[tem.count_unit > 500]
for i in a['name']:
b = clean[(clean.price>0) & (clean.property_type=='Unit') & (clean.suburb == i)].query('sold_date>"2007-12-31"').copy()
g = b.set_index('sold_date').groupby(pd.TimeGrouper('Y'))
fig = plt.figure(figsize=(15, 12))
ax = fig.add_subplot(411)
ax.plot(g.price.mean().dropna());
plt.title(i)
a = tem[tem.count_apartment>500]
for i in a['name']:
b = clean[(clean.price>0) & (clean.property_type=='Apartment') & (clean.suburb == i)].query('sold_date>"2007-12-31"').copy()
g = b.set_index('sold_date').groupby(pd.TimeGrouper('Y'))
fig = plt.figure(figsize=(15, 12))
ax = fig.add_subplot(411)
ax.plot(g.price.mean().dropna());
plt.title(i)
I'll make a function to make it easier to check out the price trends in different suburbs.
def plott(aa,bb,cc,dd):
tt1 = clean[(clean.price>0) & (clean.suburb==aa)].query('sold_date>"2007-12-31"').copy()
tt2 = clean[(clean.price>0) & (clean.suburb==bb)].query('sold_date>"2007-12-31"').copy()
tt3 = clean[(clean.price>0) & (clean.suburb==cc)].query('sold_date>"2007-12-31"').copy()
tt4 = clean[(clean.price>0) & (clean.suburb==dd)].query('sold_date>"2007-12-31"').copy()
grouped1 = tt1[tt1.property_type=='House'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped2 = tt2[tt2.property_type=='House'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped3 = tt3[tt3.property_type=='House'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped4 = tt4[tt4.property_type=='House'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
fig = plt.figure(figsize=(15, 12))
ax = fig.add_subplot(411)
ax.plot(grouped1.price.mean().dropna());
ax.plot(grouped2.price.mean().dropna());
ax.plot(grouped3.price.mean().dropna());
ax.plot(grouped4.price.mean().dropna());
plt.title('House')
plt.legend([aa,bb,cc,dd]);
grouped1 = tt1[tt1.property_type=='Unit'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped2 = tt2[tt2.property_type=='Unit'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped3 = tt3[tt3.property_type=='Unit'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped4 = tt4[tt4.property_type=='Unit'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
az = fig.add_subplot(412)
az.plot(grouped1.price.mean().dropna());
az.plot(grouped2.price.mean().dropna());
az.plot(grouped3.price.mean().dropna());
az.plot(grouped4.price.mean().dropna());
plt.title('Unit')
plt.legend([aa,bb,cc,dd]);
grouped1 = tt1[tt1.property_type=='Townhouse'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped2 = tt2[tt2.property_type=='Townhouse'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped3 = tt3[tt3.property_type=='Townhouse'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped4 = tt4[tt4.property_type=='Townhouse'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
ac = fig.add_subplot(413)
ac.plot(grouped1.price.mean().dropna());
ac.plot(grouped2.price.mean().dropna());
ac.plot(grouped3.price.mean().dropna());
ac.plot(grouped4.price.mean().dropna());
plt.title('Townhouse')
plt.legend([aa,bb,cc,dd]);
grouped1 = tt1[tt1.property_type=='Apartment'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped2 = tt2[tt2.property_type=='Apartment'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped3 = tt3[tt3.property_type=='Apartment'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
grouped4 = tt4[tt4.property_type=='Apartment'].set_index('sold_date').groupby(pd.TimeGrouper('M'))
ac = fig.add_subplot(414)
ac.plot(grouped1.price.mean().dropna());
ac.plot(grouped2.price.mean().dropna());
ac.plot(grouped3.price.mean().dropna());
ac.plot(grouped4.price.mean().dropna());
plt.title('Apartment')
plt.legend([aa,bb,cc,dd]);
aa=clean.suburb.value_counts()[:20].index[0]
bb=clean.suburb.value_counts()[:20].index[1]
cc=clean.suburb.value_counts()[:20].index[2]
dd=clean.suburb.value_counts()[:20].index[4]
plott(aa,bb,cc,dd)
aa='Delacombe'
bb='Lake Gardens'
cc='Wendouree'
dd='Alfredton'
plott(aa,bb,cc,dd)
To compare suburbs with each other I'll pick the busiest suburbs based on number of properties sold, then I'll calculate the difference in property median price each year and average the differences while giving higher weights to recent years.
def average_property_growth(sub,ptype):
return clean[(clean.suburb==sub) & (clean.price>10000) & (clean.property_type==ptype)].query(
'sold_date>"2007-12-31" & sold_date < "2019-01-01"').groupby(
['year']).price.median().pct_change()
def average_property_growth_alltypes(sub):
return clean[(clean.suburb==sub) & (clean.price>10000)].query(
'sold_date>"2007-12-31" & sold_date < "2019-01-01"').groupby(
['year']).price.median().pct_change()
top100 = clean.groupby(['suburb']).address.count().sort_values(ascending=False)[:100]
subg2 = pd.DataFrame()
for sub in top100.index:
a = average_property_growth_alltypes(sub)
subg2 = subg2.append({'name': sub, 'growth': a}, ignore_index=True)
subg2['mean_growth']=subg2.growth.apply(lambda x : (x.values[1:]*[1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2][-len(x.values[1:]):]).mean())
subg2.sort_values(by='mean_growth',ascending=False)[:10]
subg2.to_csv('top100_suburb_growth_all.csv',index=False)
plt.barh(y=subg2.sort_values(by='mean_growth',ascending=False)[:10].name,
width=subg2.sort_values(by='mean_growth',ascending=False)[:10].mean_growth)
plt.xticks(rotation=0)
plt.title('Average of annual median property price change in top 100 most traded suburbs');
Now let's do the same but this time for each type of property separately. I'll check to top 100 suburbs for houses and top 50 suburbs for the other 3 property types.
To make the comparison fair, I'll give increasing weight to more recent years.
subg = pd.DataFrame()
for sub in tem.sort_values(by='count_house',ascending=False)[:100].name:
a = average_property_growth(sub,'House')
subg = subg.append({'name': sub, 'house_growth': a}, ignore_index=True)
for sub in tem.sort_values(by='count_townhouse',ascending=False)[:50].name:
b = average_property_growth(sub,'Townhouse')
subg = subg.append({'name': sub, 'townhouse_growth': b}, ignore_index=True)
for sub in tem.sort_values(by='count_unit',ascending=False)[:50].name:
c = average_property_growth(sub,'Unit')
subg = subg.append({'name': sub, 'unit_growth': c}, ignore_index=True)
for sub in tem.sort_values(by='count_apartment',ascending=False)[:50].name:
d = average_property_growth(sub,'Apartment')
subg = subg.append({'name': sub, 'apartment_growth': d}, ignore_index=True)
subg['h_mean']=subg.house_growth[:100].apply(lambda x : (x[1:]*[1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2][-len(x.values[1:]):]).mean())
subg['t_mean']=subg.townhouse_growth[100:150].apply(lambda x : (x[1:]*[1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2][-len(x.values[1:]):]).mean())
subg['u_mean']=subg.unit_growth[150:200].apply(lambda x : (x[1:]*[1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2][-len(x.values[1:]):]).mean())
subg['a_mean']=subg.apartment_growth[200:250].apply(lambda x : (x[1:]*[1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2][-len(x.values[1:]):]).mean())
subg.sort_values(by='t_mean',ascending=False)
fig=plt.figure(figsize=(16,16))
ax = fig.add_subplot(411)
ax.plot(subg.house_growth[subg.sort_values(by='h_mean',ascending=False).index[0]][1:])
ax.plot(subg.house_growth[subg.sort_values(by='h_mean',ascending=False).index[1]][1:])
ax.plot(subg.house_growth[subg.sort_values(by='h_mean',ascending=False).index[2]][1:])
ax.plot(subg.house_growth[subg.sort_values(by='h_mean',ascending=False).index[3]][1:])
plt.title('Annual growth rate of Houses in the top 4 suburbs')
plt.legend([subg.sort_values(by='h_mean',ascending=False).name.values[0],subg.sort_values(by='h_mean',ascending=False).name.values[1],
subg.sort_values(by='h_mean',ascending=False).name.values[2],subg.sort_values(by='h_mean',ascending=False).name.values[3]])
az = fig.add_subplot(412)
az.plot(subg.townhouse_growth[subg.sort_values(by='t_mean',ascending=False).index[0]][1:])
az.plot(subg.townhouse_growth[subg.sort_values(by='t_mean',ascending=False).index[1]][1:])
az.plot(subg.townhouse_growth[subg.sort_values(by='t_mean',ascending=False).index[2]][1:])
az.plot(subg.townhouse_growth[subg.sort_values(by='t_mean',ascending=False).index[3]][1:])
plt.title('Annual growth rate of Townhouses in the top 4 suburbs')
plt.legend([subg.sort_values(by='t_mean',ascending=False).name.values[0],subg.sort_values(by='t_mean',ascending=False).name.values[1],
subg.sort_values(by='t_mean',ascending=False).name.values[2],subg.sort_values(by='t_mean',ascending=False).name.values[3]])
az = fig.add_subplot(413)
az.plot(subg.unit_growth[subg.sort_values(by='u_mean',ascending=False).index[0]][1:])
az.plot(subg.unit_growth[subg.sort_values(by='u_mean',ascending=False).index[1]][1:])
az.plot(subg.unit_growth[subg.sort_values(by='u_mean',ascending=False).index[2]][1:])
az.plot(subg.unit_growth[subg.sort_values(by='u_mean',ascending=False).index[3]][1:])
plt.title('Annual growth rate of Townhouses in the top 4 suburbs')
plt.legend([subg.sort_values(by='u_mean',ascending=False).name.values[0],subg.sort_values(by='u_mean',ascending=False).name.values[1],
subg.sort_values(by='u_mean',ascending=False).name.values[2],subg.sort_values(by='u_mean',ascending=False).name.values[3]])
az = fig.add_subplot(414)
az.plot(subg.apartment_growth[subg.sort_values(by='a_mean',ascending=False).index[0]][1:])
az.plot(subg.apartment_growth[subg.sort_values(by='a_mean',ascending=False).index[1]][1:])
az.plot(subg.apartment_growth[subg.sort_values(by='a_mean',ascending=False).index[2]][1:])
az.plot(subg.apartment_growth[subg.sort_values(by='a_mean',ascending=False).index[3]][1:])
plt.title('Annual growth rate of Apartments in the top 4 suburbs')
plt.legend([subg.sort_values(by='a_mean',ascending=False).name.values[0],subg.sort_values(by='a_mean',ascending=False).name.values[1],
subg.sort_values(by='a_mean',ascending=False).name.values[2],subg.sort_values(by='a_mean',ascending=False).name.values[3]]);
It would also be interesting to know how real estate agencies function in different suburbs and whether of not some of them are mosre successful than the other ones.
To do this I categorize them based on the suburbs and then compare thier sold prices in a specific suburb for a specific type of property such a 3 bedroom House.
Narrowing down the dataset too much would result in not having many rows in each investigation and not getting a smooth answer.
agencies = clean.groupby(['agency']).address.count().sort_values(ascending=False)[:50]
suburbs = clean.groupby(['suburb']).address.count().sort_values(ascending=False)[:100]
clean[(clean.suburb=='Pakenham') & (clean.agency=='Ray White') & (clean.price>0) & (clean.year>2017)]
new_2018 = pd.DataFrame()
for i in suburbs.index:
for j in agencies.index:
a = clean[(clean.agency== j) & (clean.suburb==i )& (clean.price>0) & (clean.year == 2018)].copy()
h2 = a[(a.property_type=='House') & (a.beds==2)].price.mean()
h3 = a[(a.property_type=='House') & (a.beds==3)].price.mean()
h4 = a[(a.property_type=='House') & (a.beds==4)].price.mean()
h5 = a[(a.property_type=='House') & (a.beds==5)].price.mean()
t1 = a[(a.property_type=='Townhouse') & (a.beds==1)].price.mean()
t2 = a[(a.property_type=='Townhouse') & (a.beds==2)].price.mean()
t3 = a[(a.property_type=='Townhouse') & (a.beds==3)].price.mean()
t4 = a[(a.property_type=='Townhouse') & (a.beds==4)].price.mean()
u1 = a[(a.property_type=='Unit') & (a.beds==1)].price.mean()
u2 = a[(a.property_type=='Unit') & (a.beds==2)].price.mean()
u3 = a[(a.property_type=='Unit') & (a.beds==3)].price.mean()
u4 = a[(a.property_type=='Unit') & (a.beds==4)].price.mean()
a1 = a[(a.property_type=='Apartment') & (a.beds==1)].price.mean()
a2 = a[(a.property_type=='Apartment') & (a.beds==2)].price.mean()
a3 = a[(a.property_type=='Apartment') & (a.beds==3)].price.mean()
new_2018 = new_2018.append({'suburb':i, 'agency':j, 'h2':h2, 'h3':h3, 'h4':h4, 'h5':h5, 't1':t1, 't2':t2, 't3':t3
, 't4':t4, 'u1':u1, 'u2':u2, 'u3':u3, 'u4':u4, 'a1':a1, 'a2':a2, 'a3':a3},ignore_index=True)
new_2018.to_csv('suburb_agency_2018.csv')
new_2018